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Abstract 

Gene assembly in ciliates is one of the most involved DNA processings going on in 
any organism. This process transforms one nucleus (the micronucleus) into another 
functionally different nucleus (the macronucleus) . We continue the development 
of the theoretical models of gene assembly, and in particular we demonstrate the 
use of the concept of the breakpoint graph, known from another branch of DNA 
transformation research. More specifically: (1) we characterize the intermediate gene 
patterns that can occur during the transformation of a given micronuclear gene 
pattern to its macronuclear form; (2) we determine the number of applications of the 
loop recombination operation (the most basic of the three molecular operations that 
accomplish gene assembly) needed in this transformation; (3) we generalize previous 
results (and give elegant alternatives for some proofs) concerning characterizations 
of the micronuclear gene patterns that can be assembled using a specific subset of 
the three molecular operations. 



1 Introduction 



Ciliates are single cell organisms that have two functionally different nuclei, 
one called micronucleus and the other called macronucleus (both of which 
can occur in various multiplicities). At some stage in sexual reproduction a 
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micronucleus is transformed into a macronucleus in a process called gene as- 
sembly. This is the most involved DNA processing in living organisms known 
today. The reason that gene assembly is so involved is that the genome of the 
micronucleus may be dramatically different from the genome of the macronu- 
cleus — this is particularly true in the stichotrichs group of ciliates, which we 
consider in this paper. The investigation of gene assembly turns out to be very 
exciting from both biological and computational points of view. 

Another research area concerned with transformations of DNA is sorting by 
reversal, see, e.g., [10,8,1]. Two different species can have several contiguous 
segments in their genome that are very similar, although there relative order 
(and orientation) may differ in both genomes. In the theory of sorting by 
reversal one tries to determine the number of operations needed to reorder 
such a series of genomic 'blocks' from one species into that of another. An 
essential tool is the breakpoint graph (or reality and desire diagram) which is 
used to capture both the present situation, the genome of the first species, 
and the desired situation, the genome of the second species. 

Motivated by the breakpoint graph, we introduce the notion of reduction graph 
into the theory of gene assembly. The intuition of 'reality and desire' remains 
in place, but the technical details are different. Instead of one operation, the re- 
versal, we have three operations. Furthermore, these operations are irreversible 
and can only be applied on special positions in the string, called pointers. Also, 
instead of two different species, we deal with two different nuclei — the real- 
ity is a gene in its micronuclear form, and desire is the same gene but in its 
macronuclear form. Surprisingly, where the breakpoint graph in the theory of 
sorting by reversal is mostly useful to determine the number of needed oper- 
ations, the reduction graph has different uses in the theory of gene assembly, 
providing valuable insights into the gene assembly process. Adapted from the 
theory of sorting by reversal, and applied to the theory of gene assembly in 
ciliates, we hope the reduction graph can serve as a 'missing link' to connect 
the two fields. 

For example, the reduction graph allows for a direct characterization of the 
intermediate strings that may be constructed during the transformation of a 
given gene from its micronuclear form to its macronuclear form (Theorem 18). 
Also, it makes the number of loop recombination operations (see Figure 3 
below) needed in this transformation quite explicit as the number of cyclic 
(connected) components in the reduction graph (Theorem 26). 

Each micronuclear form of a gene defines a sequence of (oriented) segments, 
the boundaries of which define the pointers where splicing takes place. In 
abstract representation, the gene defines a so-called realistic string in which 
every pointer is denoted by a single symbol. Each pointer occurs twice (up 
to inversion) in that string. Not every string in which each symbol has two 
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Fig. 1. The MAC form of genes. 
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Fig. 2. The MIC form of genes. 

occurrences (up to inversion) can be obtained as the representation of a mi- 
cronuclear gene. Our results are obtained in the larger context, i.e., they are 
not only valid for realistic strings, but for legal strings in general. 



The paper is organized as follows. In Section 2 we briefly discuss the basics 
of gene assembly in ciliates, and describe three molecular operations stipu- 
lated to accomplish gene assembly. The reader is referred to monograph [4] 
for more background information. In Section 3 we recall some basic notions 
and notation concerning strings and graphs, and then in Section 4 we recall 
the string pointer reduction system, which is a formal model of gene assembly. 
This model is used throughout the rest of this paper. In Section 5 we intro- 
duce the operation of pointer removal, which forms a useful formal tool in this 
paper. Then in Sections 6 and 7 we introduce our main construct, the reduc- 
tion graph, and discuss the transformations of it that correspond to the three 
molecular operations. In Section 8 we provide a characterization of interme- 
diate forms of a gene resulting from its assembly to the macronuclear form - 
then, in Section 9 we determine the number of loop recombination operations 
required in this assembly. As an application of this last result, in Section 10 we 
generalize some well-known results from [5] (and Chapter 13 in [4]) as well as 
give elegant alternatives for these proofs. A conference edition of this paper, 
containing selected results without proofs, was presented at CompLife [2]. 



2 Background: Gene Assembly in Ciliates 



This section discusses the biological origin for the string pointer reduction 
system, the formal model we discuss in Section 4 and use throughout this 
paper. Let us recall that the inversion of a double stranded DNA sequence 
M, denoted by M, is the point rotation of M by 180 degrees. For example, if 

GACGT _ ACGTC 

M = , then M = 

CTGCA TGCAG 

Ciliates are unicellular organisms (eukaryotes) that have two kinds of function- 
ally different nuclei: the micronucleus (MIC) and the macronucleus (MAC). 
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Fig. 3. The loop recombination operation. 
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Fig. 4. The hairpin recombination operation. 

All the genes occur in both MIC and MAC, but in very different forms. For 
a given individual gene (in given species) the relationship between its MAC 
and MIC form can be described as follows. 

The MAC form G of a given gene can be represented as the sequence Mi , M2 , . . . 
of overlapping segments (called MDSs) which form G in the way shown in Fig- 
ure 1 (where the overlaps are given by the shaded areas). The MIC form g of 
the same gene is formed by a specific permutation M ix , . . . , M ik of M 1; . . . , 
in the way shown in Figure 2, where I±, I2, . . . , ifc-i are segments of DNA 
(called IESs) inserted in-between segments M^, . . . ,M ik with each Mj equal 
to either Mj or Mj (the inversion of Mj). As clear from Figure 1, each MDS 
Mj except for Mi and M& (the first and the last one) begins with the overlap 
with Mj_! and ends with the overlap with Mj +1 — these overlap 
called pointers; the former is the incoming pointer of Mj denoted by pi, and 
the latter is the outgoing pointer of Mj denoted by Pi+i- Then Mi has only 
the outgoing pointer P2, and M& has only the incoming pointer pk- 

The MAC is the (standard eukaryotic) 'household' nucleus that provides RNA 
transcripts for the expression of proteins — hence MAC genes are functional 
expressible genes. On the other hand the MIC is a dormant nucleus where 
no production of RNA transcripts occurs. As a matter of fact MIC becomes 
active only during sexual reproduction. Within a part of sexual reproduction 
in a process called gene assembly, MIC genes are transformed into MAC genes 
(as MIC is transformed into MAC). In this transformation the IESs from the 
MIC gene g (see Figure 2) must be excised and the MDSs must be spliced 
(overlapping on pointers) in their order Mi , . . . , M^ to form the MAC gene G 
(see Figure 1). 

The gene assembly process is accomplished through the following three molec- 
ular operations, which through iterative applications beginning with the MIC 
form g of a gene, and going through intermediate forms, lead to the formation 
of the MAC form G of the gene. 



Loop recombination The effect of the loop recombination operation is il- 



4 



X 


p 


y 


q 


z 


p 


u 


q 


w 



X 


p \ 


u 


q \ 


Z 


p \ 


y 


\q \ 


w 



Fig. 5. The double-loop recombination operation. 

lustrated in Figure 3. The operation is applicable to a gene pattern (i.e., 
MIC or an intermediate form of a gene) which has two identical pointers p, 
p separated by a single IES y. The application of this operation results in 
the excision from the DNA molecule of a circular molecule consisting of y 
(and a copy of the involved pointer) only. 

Hairpin recombination The effect of the hairpin recombination operation 
is illustrated in Figure 4. The operation is applicable to a gene pattern 
containing a pair of pointers p, p in which one pointer is an inversion of the 
other. The application of this operation results in the inversion of the DNA 
molecule segment that is contained between the mentioned pair of pointers. 

Double-loop recombination The effect of the double-loop recombination 
operation is illustrated in Figure 5. The operation is applicable to a gene 
pattern containing two identical pairs of pointers for which the segment of 
the molecule between the first pair of pointers overlaps with the segment of 
the molecule between the second pair of pointers. The application of this 
operation results in interchanging the segment of the molecule between the 
first two (of the four) pointers in the gene pattern and the segment of the 
molecule between the last two (of the four) pointers in the gene pattern. 

For a given MIC gene g, a sequence of (applications of) these molecular op- 
erations is successful if it transforms g into its MAC form G. The gluing of 
MDS Mj with MDS M J+ i on the common pointer Pj+i results in a compos- 
ite MDS. This means that after gluing, the outgoing pointer of Mj and the 
incoming pointer of M? + i are not pointers anymore, because pointers are al- 
ways positioned on the boundary of MDSs (hence they are adjacent to IESs). 
Therefore, the molecular operations can be seen as operations that remove 
pointers. This is an important property of gene assembly which is crucial in 
the formal models of the gene assembly process (see [4]). 



3 Basic Notions and Notation 



In this section we recall some basic notions concerning functions, strings, and 
graphs. We do this mainly to set up the basic notation and terminology for 
this paper. 

The empty set will be denoted by 0. The composition of functions / : X — > Y 
and g : Y — > Z is the function gf : X —> Z such that (gf)(x) = g(f(x)) for 
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every x G X. The restriction of / to a subset A of X is denoted by f\A. 



We will use A to denote the empty string. For strings u and v, we say that v 
is a substring of u if u = W1VW2, for some strings Wi, w 2 ; we also say that v 
occurs in u. For a string x = x\x 2 ■ ■ ■ x n over E with x\, x 2 , ■ ■ ■ , x n G E, we say 
that substrings x^ • ■ ■ x^ and Xj 2 - • • Xj 2 of x overlap in x if zi < i 2 < j\ < j 2 
or i 2 <i x < j 2 < ji. 

For alphabets E and A, a homomorphism is a function y> : E* — > A* such that 
<p{xy) = ip(x)(p(y) and for all x, y G E*. Let : E* — ► A* be a homomorphism. 
If there is a T C E such that 



then is denoted by eraser- 

We move now to graphs. A labelled graph is a 4-tuple G = (V, E, f, where 
y is a finite set, ^ is an alphabet, £" is a finite subset of F x f * x and 
/ : D — > r, for some DCV and some alphabet T, is a partial function on V. 
The elements of V are called vertices, and the elements of E are called edges. 
Function / is the vertex labelling function, the elements of V are the vertex 
labels, and the elements of \P* are the edge labels. 

For e = (x, u, y) G V x \P* x V , x is called the initial vertex of e, denoted by 
t(e), y is called the terminal vertex of e, denoted by r(e), and -u is called the 
label of e, denoted by £(e). Labelled graph G' = (V, E', f\V , ^) is an induced 
subgraph of G if V C V and E' = E D (V x ^* x V). We also say that G' is 
the subgraph of G induced by V. 

A walk in G is a string 7r = eie2 ■ ■ ■ e n over i? with n > 1 such that T(e«) = 
t(ei + i) for 1 < i < n. The label of it is the string £(n) = £(ei)£(e 2 ) ■ ■ -£(e n ). 
Vertex /,(ei) is called the initial vertex of n, denoted by i(7r), vertex r(e n ) is 
called the terminal vertex of it, denoted by t(tt) and we say that it is a walk 
between l(7t) and t{tx) (or that it is a walk from to t[ti)). We say that G 
is weakly connected if for every two vertices and v 2 of G with v 2 ^ v\, there 
is string eie2 ■ ■ ■ e n over E" U {(r(e), ^(e), t(e)) | e G £"} with n > 1, i(ei) = V\, 
T~(e n ) = v 2 , and r(ei) = i{ei + \) for 1 < i < n. A subgraph H of G induced 
by Vh Q V is a component of G if H is weakly connected, and for every edge 
e E E either t(e),r(e) G or t(e),r(e) G V\Vff. 

The isomorphism between two labelled graphs is defined in the usual way. 
Two labelled graphs G = (V, E, f, and G' = (V, E', f, are isomorphic, 
denoted by G ~ G', if there is a bijection a : V —> V such that f{v) = f'(a(v)) 




x ^ r 
x g r ' 
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for all v G V, and 

(x,u,y) G E iff (a(x),u,a(y)) G -E 1 ', 

for all x,y £ V and « 6 The bijection a is then called an isomorphism 
from G to G' . 

In this paper we will consider walks in labelled graphs that often originate in 
a fixed source vertex and will end in a fixed target vertex. Therefore, we need 
the following notion. 

A two-ended graph is a 6-tuple G = (V, E, f,^f , s,t), where (V, E, /, is a 
labelled graph, / is a function on V\{s, t} and s, t G V where s ^ t. Vertex s 
is called the source vertex of G and vertex t is called the target vertex of G. 
The basic notions and notation for labelled graphs carry over to two-ended 
graphs. However, for the notion of isomorphism, care must be taken that the 
two ends are preserved. Thus, if G and G 1 are two-ended graphs, and a is a 
isomorphism from G to G', then a(s) = s' and a(t) = t', where s (V, resp.) 
is the source vertex of G (G f , resp.) and t (£', resp.) is the target vertex of G 
(G', resp.). 



4 The String Pointer Reduction System 



In this paper we consider the string pointer reduction system, which we will 
recall now (see also [3] and Chapter 9 in [4]). 

We fix k > 2, and define the alphabet A = {2, 3, . . . , k}. For D C A, we define 
D = {a | a G D} and U D = D U D; also II = IIa. We will use the alphabet II 
to formally denote the pointers — the intuition is that the pointer p^ will be 
denoted by either % or i. Accordingly, elements of II will also be called pointers. 

We use the 'bar operator' to move from A to A and back from A to A. Hence, 
for p G n, p = p. For a string u = x±x 2 ■ ■ ■ x n with Xi G n, the inverse of u is 

!p if p G A 
_ , i.e., p is 
p if p G A 

the 'unbarred' variant of p. The domain of a string v G n* is dom(v) = {p | 
p occurs in v}. A legal string is a string iifll* such that for each pGlI that 
occurs in it, u contains exactly two occurrences from {p,p}. 

We define the alphabet K = {Mi, Mi \ 1 < i < k} — these symbols denote 
the MDSs and their inversions. With each string over K , we associate a unique 
string over n through the homomorphism ir K : 0* — >• n* defined by: 

7T«(Mi) = 2, 7T K (M K ) = K, TT K (Mi) = i(i +1) for 1< I < K, 
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and n K (Mj) = n K (Mj) for 1 < j < re. A permutation of the string M\M.2 ■ ■ ■ M K , 
with possibly some of its elements inverted, is called a micronuclear pattern 
since it can describe the MIC form of a gene. String u is realistic if there is a 
micronuclear pattern 5 such that u = ir K (S). 

Example 1 The MIC form of the gene that encodes the actin protein in the 
stichotrich Sterkiella nova is described by micronuclear pattern 

5 = M 3 M 4 M 6 M 5 M 7 M 9 M 2 M 1 M 8 

(see [9,4]). The associated realistic string is ir 9 (5) = 3445675678932289. □ 

Note that every realistic string is legal, but a legal string need not be realistic. 
For example, a realistic string cannot have 'gaps' (missing pointers): thus 2244 
is not realistic while it is legal. It is also easy to produce examples of legal 
strings which do not have gaps but still are not realistic — 3322 is such an 
example. For a pointer p and a legal string u, if both p and p occur in u then 
we say that both p and p are positive in u; if on the other hand only p or only p 
occurs in u, then both p and p are negative in u. So, every pointer occurring in 
a legal string is either positive or negative in it. A nonempty legal string with 
no proper nonempty legal substrings is called elementary. For example, the 
legal string 234324 is elementary, while the legal string 234342 is not (because 
3434 is a proper legal substring). 

Definition 2 Let u = x\x 2 ■ ■ ■ x n be a legal string with X{ G II for 1 < i < n. 
For a pointer p G II such that {xi,Xj} C {p,p} and 1 < i < j < n, the 
p-interval of u is the substring XiX i+ i ■■ -Xj. Two distinct pointers p, q G II 
overlap in u if the p-interval of u overlaps with the (/-interval of u. 

The string pointer reduction system consists of three types of reduction rules 
operating on legal strings. For all p, q G IT with p / q: 

• the string negative rule for p is defined by snr p (uippu2) = U1U2, 

• the string positive rule for p is defined by spr p {uipu2pu^) = U1U2U3, 

• the string double rule for p, q is defined by sdr p } q{uipu2quspu4qu^) = U\U4U^U2U^ n 

where ui, U2, ■ ■ ■ , u$ are arbitrary strings over IT. 

Note that each of these rules is defined only on legal strings that satisfy the 
given form. For example, snr2 is not defined on legal string 2323. It is im- 
portant to realize that for every non-empty legal string there is at least one 
reduction rule applicable. Indeed, every legal string for which no string positive 
rule and no string double rule is applicable must have only nonoverlapping, 
negative pointers and thus a string negative rule is applicable. 

We also define Snr = {snr p | p G II}, Spr = {spr p | p G 11} and Sdr = 
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{sdr Pi g | p, q G II, p 7^ q} to be the sets containing all the reduction rules of a 
specific type. 

The string negative rule corresponds to the loop recombination operation, the 
string positive rule corresponds to the hairpin recombination operation, and 
the string double rule corresponds to the double-loop recombination operation. 
Note that the fact (pointed out at the end of Section 2) that the molecular 
operations remove pointers is explicit in the string pointer reduction system 
- indeed when a string rule for a pointer p (or pointers p and q) is applied, 
then all occurrences of p and p (or p, p, q and q) are removed. 

Definition 3 The domain dom(p) of a reduction rule p equals the set of 
unbarred variants of the pointers the rule is applied to, i.e., dom(snr p ) = 
dom(spr p ) = {p} and dom(sdr p g ) = {p, q} for p, q G IT. For a composition 
ip = tpi <p 2 ■ ■ ■ f n °f reduction rules cpi, ip 2 , ■ ■ ■ , ip n , the domain dom(ip) is the 
union of the domains of its constituents, i.e., dom(ip) = dom(<pi) U dom(<p2) U 
• • ■ U dom((p n ). 

Definition 4 Let u and v be legal strings and S C {Snr, Spr, Sdr}. Then a 
composition ip of reduction rules from S is called an (S-)reduction ofu, if ip is 
applicable to (defined on) u. A successful reduction <p of u is a reduction of u 
such that (p(u) = A. We then also say that ip is successful for u. We say that 
u is reducible to v in S if there is a S'-reduction tp of u such that cp(u) = v. 
We simply say that u is reducible to v if u is reducible to v in {Snr, Spr, Sdr}. 
We say that u is successful in S if u is reducible to A in S. 

Note that if ip is a reduction of u, then dom((p) = dom(u)\dom((p(u)) . Because 
(as pointed out already) for every non-empty legal string there is at least one 
reduction rule applicable, we easily obtain Theorem 9.1 in [4] which states 
that every legal string is successful in {Snr, Spr, Sdr}. 

Example 5 Let S = {Snr, Spr}, u = 32454532, and v = 5454. Then u 
is reducible to v in S, because (snr 3 spr 2 )(ti) = v. Since applying ip = 
spr§ spr 4 snrg spr 3 to u yields A, p> is successful for u. On the other hand, 
u = 3232 is not reducible to any v in S, because none of the rules in Snr and 
none of the rules in Spr is applicable for this u. □ 

Referring to the Introduction, in Theorem 18 we present a characterization of 
the intermediate strings that may be constructed during the transformation 
of a given gene from its micronuclear form to its macronuclear form. Formally, 
this is a characterization of reducibility, which allows one to determine for 
any given legal strings u and v and S C {Snr, Spr, Sdr}, whether or not u is 
reducible to v in S. This result can be seen as a generalization of the results 
from Chapter 13 in [4], which provide a characterization of successfulness for 
realistic strings, that is, for the case where u is realistic and v = A. 
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5 Pointer Removal Operation 



Let tp be a reduction of a legal string u. If we let u' be the legal string obtained 
from u be deleting all pointers from Hdom(ip(u)) > then it turns out that p is also 
a reduction of it'. In fact, p is a successful reduction of it'. This is formalized 
in Theorem 10, and thus it states a necessary condition for reducibility. In the 
following sections we will strengthen Theorem 10 to obtain a characterization 
of reducibility. 

Definition 6 For a subset D C A, the D-removal operation, denoted by 
rerun, is defined by renin = erase Du i). We also refer to reniD operations, for 
all D C A, as pointer removal operations. 

Example 7 Let u = 32454532 and D = {4, 5}. Then rem D (u) = 3232. Note 
that 2, 3 ^ D. Note also that p = snr^ spr 2 is applicable to both u and 
remoiu), but for rerun^u), tp is also successful. □ 

The following easy to verify lemma formalizes the essence of the above exam- 
ple. 

Lemma 8 Let u be a legal string and D C dom(u). Let p be a composition 
of reduction rules. 

(1) If p is applicable to rerri£,(u) and p does not contain string negative rules, 
then p is applicable to u. 

(2) If p is applicable to u and dom(p) C dom{u)\D, then p is applicable to 
remr){u). 

(3) Ifp is applicable to bothu and rerri£,{u) , then p{rerri£,{u)) = remr){p{u)). 

Note that the first statement of Lemma 8 may not be true when p is allowed to 
contain string negative rules. The obvious reason for this is that two identical 
occurrences of a pointer p may end up to be next to each other only if some 
pointers in between those occurrences are first removed by rem^. This is 
illustrated in the following example. 

Example 9 Let u = 3245453662, v = 545466 and D = dom(v). Then 
remoiu) = 3232. Note that although p = snr 3 spr 2 is a successful reduc- 
tion of rem,£,(u), p is not applicable to u. □ 

The following theorem is an immediate consequence of the previous lemma. 

Theorem 10 Let S C {Snr, Spr, Sdr} . For legal strings u and v, if u is 

reducible to v in S and D = dom{y), then rem D {u) is successful in S. 
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PROOF. Let u be reducible to v in S. Then there is an S-reduction ip 
such that ip{u) = v. By Lemma 8, ip is an S'-reduction of rem^u) and 
<p{remE>{u)) = rem D (ip(u)) = rem D {y) = A. Hence, (p is a successful S'- 
reduction of rerri£)(u). □ 



The proof of the above result observes that any reduction of u into v must be a 
successful reduction of remr>{u) where D = dom(v). Referring to Example 9, 
we now note that u is not reducible to v, because rem£>{u) has two successful 
reductions and neither is applicable to u. In fact, there is no v' with D = 
dom{v') such that u is reducible to v' . 



6 Reduction Graphs 

The main purpose of this section is to define the notion of reduction graph. A 
reduction graph represents some key aspects of reductions from a legal string 
u to a legal string v: it provides the additional requirements on u and v to 
make the reverse implication of Theorem 10 hold. In addition, it allows one 
to easily determine the number of string negative rules needed to successfully 
reduce u. We will first define the notion of a 2-edge coloured graph. 

Definition 11 A 2-edge coloured graph is a 7-tuple 

G=(V,E 1 ,E 2 ,f,*,s,t), 

where both (V, Ei, f, \& s, t) and (V, E 2 , f, ^, s, t) are two-ended graphs. Note 
that Ei and E 2 are not necessary disjoint. 

The terminology and notation for the two-ended graph carries over to 2-edge 
coloured graphs. However, for the notion of isomorphism, care must be taken 
that the two sorts of edges are preserved. Thus, if G = (V, Ei, E 2 , /, \& , s,t) 
and G = (V, E[, E' 2 , /', s', t') are two-ended graphs, then it must hold that 
for any isomorphism a from G to G', 

{x,u,y) e Ei iff (a(x),u,a(y)) e E- 

for all x, y e V, u e * and i e {1, 2}. 

We say that edges ei and e 2 have the same colour if either ei,e 2 £ Ei or 
ei,e 2 £ E 2 , otherwise they have different colours. An alternating walk in G 
is a walk n = eie 2 ■ ■ ■ e n in G such that and ei+i have different colours for 
1 < i < n. For each edge e with £(e) £ n*, we define (r(e), /(e), i(e)), denoted 
by e, as the reverse of e. 
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Fig. 6. Part of a genome with three pointer pairs corresponding to the same gene. 




Fig. 7. The reduction graph corresponding to the underlying genome. 

We are ready now to define the notion of a reduction graph, the main technical 
notion of this paper. The reduction graph is a 2-edge coloured graph and it is 
defined for a legal string u and a set of pointers D C dom{u). The intuition 
behind it is as follows. 



Figure 6 depicts a part of a genome with three pointer pairs corresponding to 
the same gene g. The reduction graph introduces two vertices for each pointer 
and two special vertices s and t representing the ends. It connects adjacent 
pointers through reality edges and connects pointers corresponding to the same 
pointer pair through desire edges in a way that reflects how the parts will be 
glued after a molecular operation is applied on that pointer. The resulting 
reduction graph is depicted in Figure 7. Thus, every reality edge corresponds 
to a certain DNA segment. If such a DNA segment contains other pointers of 
g, then these pointers form the label of that reality edge. 

By definition a realistic string has a physical interpretation. It shows the 
boundaries of the MDSs, and how these should be recombined (following their 
orientation). Considering a subset of these pointers, we still have the phys- 
ical interpretation, although the other pointers are hidden in the segments. 
Technically, however, removing a subset of the pointers may change a real- 
istic string into a legal one that is no longer realistic or even realizable (by 
renaming pointers we cannot obtain a realistic string). An example of such a 
case is given in the introduction of Section 10. In fact, each legal string has 
a physical interpretation with pointers indicating how parts of the string are 
to be reconnected, cf. Figure 7, where no use is made of any MDS-IES seg- 
mentation. Thus our definition of reduction graph works for legal strings in 
general, rather than only for realistic ones. The intuition of a reduction graph 
is similar to the intuition behind a reality and desire diagram (or breakpoint 
graph) from [7,8]. 

Formally, the reduction graph of legal string u with respect to D C dom{u) 
shows how u is reduced to a legal string v with dom(v) = D by any possible 
reduction (p. The vertices of the graph correspond to (two copies of each of) 
the pointers that are removed during the reduction (those in Udom(u)\D) ■ As 
illustrated above, we have two types of edges. The desire edges are unlabelled 
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So Si <?2 ,^3 , S4 65 $6 

s^th i[^i 2 i' 2 ^h i'^h i'^h i'^h i^t 

So Si &2 Sz 84 <5 5 8(; 

Fig. 8. The part of the reduction graph of the legal string u with respect to D 
as defined in Example 13 which involves only reality edges (the vertex labels are 
omitted). 

and connect the pointer pairs in Hdom(u)\D, while reality edges connect the 
successive pointers in Hdom{u)\D and are labelled by the strings over U* D that 
are in between these pointers in u. 

Definition 12 Let DC A and let u be a legal string, such that u = 5opi5ip 2 ■ ■ ■ p n 8 n 
where 8o,...,S n G H* D and p%, . . . ,p n G Hdom{u)\D- The reduction graph of u 
with respect to D, denoted by H u ,d, is a 2-edge coloured graph {V, Ei, E 2 , f, II, s, t), 
where 

V = {h,I 2 ,...,I n } U {I[,I 2 ,...,l' n } U {s,t}, 
E\ = E\ iT U Eij, where 
Ei, r = {e ,ei, . . . ,e n } with d = 8i,I i+1 ) for 1 < i < n-l,e = (s,h),e n = (l' n ,t), 

E ld = {e I e G E 1>r }, 

E 2 = {(I'i, A, Ij), (Ii, A, Ij) I i, j G {1,2, . . . , n} with i ^ j and p { = pj} U 
{(Ii, (/•, X,Ij) G {l,2,...,n} andp, =pj}, and 

f(k) = f(Ii) = Pl for 1 < z < n. 

The edges of i?i are called the reality edges, and the edges of E 2 are called the 
desire edges. Note that E\ and i?2 are not necessary disjoint. The components 
of 7Z Uj d that do not contain s and t are called cyclic components. When D = 0, 
we simply refer to 1Z u ,d as the reduction graph of u. 

Thus the reduction graph is a 'superposition' of two graphs on the same set of 
vertices V: one graph with edges from Ei (reality edges), and one graph with 
edges from E 2 (desire edges). The following example should make the notion 
of reduction graph more clear. 

Example 13 Let u = 52688325437746 be a legal string and D = {5, 6, 7, 8} C 
dom{u). Thus, {2,3,4} = dom(u)\D, and 

u = 8 2 5 Y 3 5 2 2 5 3 4 5 4 3 5 5 4 5 6 

with 5 = 5, 5i — 688, S 2 = A, 5 3 = 5, <5 4 = A, <5 5 = 77 and <5 6 = 6. Notice that 
5i,5 2 , . . . ,5q G H* d . This example corresponds to the situation in Figure 6. 
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s h A h r 2 h i's h A h A h A t 




Fig. 9. The part of the reduction graph of the legal string u with respect to D as 
defined in Example 13, where only desire edges are shown (the vertex labels are 
omitted). Crossing edges correspond to positive pointers. 




Fig. 10. The reduction graph 7^ u D as defined in Example 13 (the vertex labels are 
omitted). 



So 
So 



2 

Fig. 11. The reduction graph of Figure 10 where every vertex (except s and t) is 
represented by its label. 

The reduction graph 1Z u> d of u with respect to D is given in Figure 10. It is 
the union of the graphs in Figure 8 and Figure 9. Note that for every desire 
edge e, we represent both e and e by a single unlabelled, undirected edge. The 
graphs are drawn in a form that closely relates to the linear ordering of u. The 
desire edges that cross correspond to positive pointers, and the desire edges 
that do not cross correspond to negative pointers. 

Since the exact identity of the vertices in a reduction graph is not essential 
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for the problems considered in this paper (we need only to know, modulo 
'bar', which pointer is represented by a given vertex), in order to simplify the 
pictorial notation of reduction graphs we will replace the vertices (except for s 
and t) by their labels. Figure 11 gives H Uj d in this way. In this figure we have 
reordered the vertices, making it transparent that 1Z Uj d has a single cyclic 
component (the figure illustrates why the adjective 'cyclic' was added). □ 

Note that a reduction graph is an undirected graph in the sense that if e G Ei 
(e G E 2 , resp.) then also e G E 1 (e G E 2 , resp.). If we think of a reduction graph 
as an undirected graph by considering edges e and e as one undirected edge, 
then both s and t are connected to exactly one (undirected) edge, and every 
other vertex is connected to exactly two (undirected) edges. As as corollary 
to Euler's theorem, a reduction graph has exactly one component that has a 
linear structure with s and t as endpoints and possibly one or more components 
that have a cyclic structure (the cyclic components). Thus, there is a unique 
alternating walk from s to t in every reduction graph. 

If a 2-edge coloured graph G has a unique alternating walk from s to t, then 
this walk is called the reduct of G, denoted by red(G). We know now that if 
TZ u ,d is a reduction graph of a legal string u with respect to D C dom(u), 
then the reduct exists. It is then also called the reduct of u to D, and denoted 
by red(u,D). Since TZ u ,dom(u) consists of the vertices s and t connected by 
a (reality) edge labelled by u (and by u in the reverse direction), we have 
red(u, dom{u)) = u. Also, it is clear that if 2-edge coloured graphs G\ and G 2 
are isomorphic, then red(G\) = red(G2)- 

Example 14 If we take u and D from Example 13, then 

red(u, D) = So^S^q = 56, 

which is easy to see in Figure 11. □ 



7 Reduction Function 

Before we can prove (in the next section) our main theorem on reducibility, we 
need to define reduction functions. A reduction function operates on reduction 
graphs. As we will see, these functions simulate the effect (up to isomorphism) 
of each of the three string pointer reduction rules on a reduction graph. For a 
vertex label p, the p-reduction function merges edges that form a walk 'over' 
vertices labelled by p and removes all vertices labelled by p. 

Definition 15 For each vertex label p, we define the p-reduction function rf p , 
which constructs for every 2-edge coloured graph G = (V, E 1 , E 2 , /, \P, s, t), the 
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Fig. 12. The reduction graph obtained when applying rf 2 to the reduction graph of 
Figure 11. 

2-edge coloured graph 

rf p {G) = (V, (E 1 \E rem ) U E add , E 2 \E rem , f\V, tt, s, t), 

with 



y' = {M}u{^n{M}|/(f)^}, 

^rem = {e £ -Ei U £ 2 | /(t(e)) =por /(r(e)) = p}, and 
-Eadd = {( i ( 7r )i ^( 7r ) 5 T ( 7r )) I 7T — e i e 2 • • • e n with n > 2 is an alternating walk 
in G with f{i{n)) ^ p, f{r(n)) ^ p, and /(r(ej)) = p for 1 < z < n}. 



Example 16 If we take the reduction graph 1Z u ,d from Example 13, cf. Fig- 
ure 11, then rf 2 (Jl Ut D) is given in Figure 12. □ 

It is easy to see that the following property holds for each reduction graph 
1Z Uj d and all p G dom(u)\D: 

red(TZ U:D ) = red(rf p (Tl UjD )) . 

Also, reduction functions commute under composition. Thus, if moreover there 
is a q G dom{u)\D such that p 7^ q, then 

(rf q rf p )(TZ Ui n) = (rf p rf q )(TZ u , D ). 

The main property of reduction functions is that they simulate the effect (up to 
isomorphism) of each of the three string pointer reduction rules on a reduction 
graph. 

Theorem 17 Let u be a legal string, let D C dom{u), and let (p be a reduction 
of u such that dom(Lp) = {p±,p 2 , ■ ■ ■ ,p n } Q dom(u)\D. Then 

(rf Pn ■■■ rf P2 rf Pl )(K u . D ) « K^d, 
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and red(u, D) = red((p(u), D). 



PROOF. To prove the first statement, it suffices to prove the cases where 
ip = snr p , Lp = spr p and <p = sdr Pi9 for p, q E Hdom(u)\D- 

We first prove the snr case. Assume snr p is applicable to u. We consider the 
general case 

u = u 1 q 1 8 x pp8 2 q 2 U2 

for some 81,62 E H* D , qi,q 2 £ ^dom(u)\D and Ui,u 2 E II*. In the special case 
where qi (q 2 , resp.) does not exist, the vertex labelled by qi (q 2 , resp.) in the 
graphs below equals the source vertex s (target vertex t, resp.). We will first 
prove that rf p (TZ Ut D) = T^-snr p (u),D- Because u = u\q\8\ppb 2 q 2 u 2 , the reduction 
graph TZ U)D is 

81 $2 




A 



where we omitted the parts of the graph that remain the same after applying 
rf p . Now, the graph rf p {lZ u ^) is given below. 

qi z 

This is clearly the reduction graph of snr p (ti) = uiq\8\8 2 q 2 u 2 with respect to 
D. Thus, indeed rf p (K UtD ) « TZ snrp[u ) jD . 

We now prove the spr case. Assume spr p is applicable to u. We may dis- 
tinguish three cases, which differ in the number of elements of Hdom(u)\D i n 
between p and p in «: 

(1) u = u 1 q 1 5 1 p52p5 A q l iU Z 

(2) u = u 1 q 1 S 1 p8 2 q 2 S 3 pS^q4U3 

(3) u = Uiqidipd^u^bspSiqiUz 

for some 8\, . . . ,84 E H* D , q\, . . . , q± E Hdom{u)\D, and U\, u 2 , u 3 E II*. Note that 
we have assumed that p is preceded and that p is followed by an element from 
Hdom{u)\D- The special cases where qi or g 4 do not exist, can be handled in the 
same way as we did for the snr case (by setting them equal to s and t, resp.). 
In each of the three cases, one can prove that rf p (R, u< E>) ~ H S pr p (u),D- We will 
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discuss it in detail only for the third case. The reduction graph 1Z U} d is 

qi CZI^ p PdZ^ia 

<5i <5 3 
PCZ^ q4 

where we again omitted the parts of the graph that remain the same after 
applying rf p . Now, the graph rf p (1Z u ^ D ) is given below. 

qi^^qs 

6381 
S2S4 

q 2 ^^q 4 

8462 

This graph is clearly isomorphic to the reduction graph of 

Spr p (» = Wigi5l<53g3M2<72<^4<?4«3 

with respect to D. Thus, indeed rf p (Jl Ut D) ~ 7^ sp r p («),£>• 

Finally, we prove the sdr case. Assume sdr pq is applicable to u. We only 
consider the general case (the other cases are proved similarly): 

u = ux qi5 1 p5 2 q2 u 2 <?3t>3<?5 4 <?4 u 3 q 5 S 5 pS 6 q fi u A q 7 b 7 q5 8 q 8 v>5 

for some 5 X , . . . , S 8 G Il* D , q 1 ,...,q 8 G U dom ^ u) \ D , and m, . . . , u 5 G II*. The 
reduction graph TZ Ut £> is 



qi CZ^p PCZ^ q6 

<5l && 

q2^^p pCZZ^ q5 

<52 5 5 

qsr~^q qC~^qs 

q4^^q qCZZ^q^ 

£4 £7 
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where we omitted the parts of the graph that remain the same after applying 
{rfq rf p ). Now, the graph r/ q (r/ p is given below. 

SiSe 

qi^^^qe 

<5e<5i 

S2S5 

q 2 ^ ^qs 

q.i z ^q§ 

(58(53 
8487 

q 4 ^3^q 7 

8-784, 

This graph is clearly isomorphic to the reduction graph of 

sdr M («) = Uiqi5 1 5Qq % u i q 7 8 7 8^U2,q^8 2 q2U2q2 l h^q?,u 5 

with respect to D. Thus, indeed rfq(rf p (1Z u> i))) ~ 7^ s dr Pig (?i),D- This proves the 
first statement. 

Now, by the fact that the reduction function does not change the reduct of 
the graph, and by the first statement, we have 

red{TZ U)D ) = red((rf pi rf P2 ■ ■ ■ rf Pn ){K UjD )) = red{TZ^ u) ^ D ). 

Thus, red(u, D) = red(<p(u), D) and this proves the second statement. □ 



8 Characterization of Reducibility 

We are now ready to prove our main theorem on reducibility. In Theorem 10 we 
have shown that if u is reducible to v in S, then rem dom ^ (u) is successful in S. 
Here we strengthen this theorem into an iff statement by additionally requiring 
that v equals the reduct of u to dom{y). The resulting characterization is 
independent of the chosen set of reduction rules S C {Snr, Spr, Sdr}. 

Theorem 18 Let u and v be legal strings, D = dom(v) C dom{u) and S C 
{Snr, Spr, Sdr} . Then u is reducible to v in S iff rem D {u) is successful in S 
and red{u, D) = v. 

PROOF. Let u be reducible to v in S. Therefore, there is an S'-reduction ip 
of u such that (p(u) = v. Also, remoiu) is successful in S by Theorem 10. By 
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Theorem 17, we have red(u,D) = red(ip(u) , D) . Now, red(ip(u), D) = (p(u) = 
v, because D = dom((p(u)) . 

To prove the reverse implication, let rem£,{u) be successful in S and red(u, D) = 
v. We have to prove that u is reducible to v in S. Clearly, there is a successful 
S-reduction </? of remo(it). 

Assume that tp is not applicable to u. Since <p is applicable to remrj(u), we 
know from Lemma 8 that tp — tp2 snr p tp\ for some ipi, y?2 and p, where tpi is 
applicable to u and snr p is not applicable to <pi(u). Thus, p5p is a substring 
of (fi(u) with S G il^,\{A}. Therefore the following graph 




must be isomorphic to a cyclic component of the reduction graph TZ^u)^ of 
<£>i(it) with respect to D. Because v = red(u,D) = red((fi(u), D) is a legal 
string and dom{v) = D, the labels of the reality edges of 7Z Vi ( u ),d belonging 
to cyclic components are empty. This is a contradiction and therefore <p is 
applicable to u. Now, we have (p(u) = red(tp(u), D) = red(u, D) = v, because 
D = dom(<f(u)). Thus, u is reducible to v in S. □ 

Note that the proof of Theorem 18 even proves a stronger fact. The S'-reduction 
ip of u with tp(u) = v can be taken to be same as the (successful) S'-reduction if 
of remo(«). The following corollary follows directly from the previous theorem 
and the fact that every legal string is successful in {Snr, Spr, Sdr}. 

Corollary 19 Let u and v be legal strings and D = dom(v) C dom(u). Then 
u is reducible to v iff red(u, D) = v. 

The previous corollary shows that reducibility can be checked quite efficiently 
Since the reduction graph of a legal string u has 2\u\ + 2 vertices and 8|it| +4 
edges (counting an undirected desire edge as two (directed) edges), it takes 
only linear time 0(|w|) to generate TZ U>0 using the adjacency lists represen- 
tation. Also, generating 1Z U) d for any D C dom(u) is of at most the same 
complexity as TZ u ,0- Now, since the walk from s to t does not contain vertices 
more than once, it takes only linear time to determine red(u, D) = v, and 
therefore, by the previous corollary, it takes linear time to determine whether 
or not u is reducible to v. 

The next corollary illustrates that the function of the reduct is twofold: it does 
not only determine, given u and D C dom(u), which legal string is obtained 
by applying a reduction ip of u with dom((p(u)) = D, but also whether or not 
there is such a tp. 
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Corollary 20 Let u be a legal string and D C dom(u). Then u there is a re- 
duction Lp ofu with dom(f(u)) = D iffred(u, D) is legal and dom(red(u, D)) = 
D. 



PROOF. We first prove the forward implication. If we let v = tp(u), then 
v is a legal string, u is reducible to v, and D = dom{y). By Corollary 19, 
red(u, D) = v and therefore red(u, D) is legal and dom(red(u, D)) = D. 

We now prove the reverse implication. If we let v = red(u, D), then v is legal 
and dom(v) = D. By Corollary 19, u is reducible to v. □ 

Example 21 Let u and D be as in Example 13. By Example 14, red(u, D) = 
56. Therefore by Corollary 20, there is no reduction ip of u with dom(ip{v)) = 
D. Thus, there is no reduction ip of u with dom(tp) = {2, 3, 4}. □ 



9 Cyclic Components 

In this section we consider the cyclic components of the 'full' reduction graph 
lZ Ut of a legal string u. We show that if snr p is applicable to u for some pointer 
p, then the number of cyclic components of TZ sn r p (u),0 is exactly one less than 
the number of cyclic components of 1Z U ^. On the other hand, if either spr p 
or sdr p g is applicable to u for some pointer p, q, then the number of cyclic 
components remains the same. Before we state this result (Theorem 25), we 
will prepare for its proof by studying some elementary connections between u 
and the structures in 7Z Ut0 . Since all the edges of 7Z Uj0 are labelled A, we will 
omit the labels of the edges in the figures. 

Because desire edges in a reduction graph connect vertices that are of the same 
label, for every label p, there are exactly 0, 2 or 4 vertices labelled by p in 
every cyclic component of a reduction graph. The following lemma establishes 
an additional property of the number of vertices of a single label in a cyclic 
component. 

Lemma 22 Let u be a legal string, and let P be a cyclic component in TZ U ^ . 
Let p (q, resp.) be the first (last, resp.) pointer (from left to right) in u such 
that there is a vertex in P with label p fq ; resp.). Then there are exactly two 
vertices of P labelled by p and there are exactly two vertices of P labelled by 

q 

PROOF. Assume that all four vertices labelled by p are in P. Then these 
vertices are Ii, ![, Ij and Jj for some i and j with i < j. By the definition of 
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reduction graph, there is a reality edge from vertex to vertex But by 
the definition of p, vertex cannot belong to P, which is a contradiction. 
Therefore, there are only two vertices labelled by p in P. The second claim is 
proved analogously. □ 

Note that in the previous lemma, p and q need not be distinct. Note also that 
if all the vertices of a cyclic component have the same label, than the cyclic 
component has exactly two vertices. 

Lemma 23 Let u be a legal string, and let p G II. Then TZ U , has a cyclic 
component consisting of exactly two vertices, which are both labelled by p iff' 
either pp or pp is a substring of u. 

PROOF. Let either pp or pp be a substring of u. Then 




is a cyclic component of lZ Ut0 consisting of exactly two vertices, both labelled 
by p. 

To prove the forward implication, let TZ U ^ have a cyclic component P con- 
sisting of exactly two vertices, both labelled by p. Clearly, every vertex of a 
cyclic component has exactly one incoming and one outgoing edge in each 
colour. Because there is a reality edge between the two vertices of P, I- and 
Jj + i are the vertices of P for some i. Now, since there is a desire edge (1^, h+i) 
in P, either p or p occurs twice in u. As reality edges in TZ U ^ connect adjacent 
pointers in u, either pp or pp is a substring of u. □ 

Lemma 24 Let u be a legal string, letp and q be negative pointers occurring in 
u. ThenTZ Uy0 has a cyclic component consisting of exactly two vertices labelled 
by p and two vertices labelled by q iff either u = uypqu^qpuj, oru = u\qpu<ipquj, 
for some strings U\, U2, M3 G IT*. 

PROOF. Let either u = U\pqu2qpu^ or u = U\qpu2pqu^ for some strings 
Mi, u 2 , u 3 G IT. Then 

P P 




q q 



is a cyclic component of 1Z U)0 consisting of exactly two vertices labelled by p 
and two vertices labelled by q. 
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To prove the forward implication, let 1Z Ui have a cyclic component P consist- 
ing of exactly two vertices labelled by p and two vertices labelled by q. Since 
each cyclic component 'is' a cycle of edges of alternating colour, and since de- 
sire edges connect only vertices with the same label, the component looks like 
the figure above. Since reality edges in TZ U ^ connect adjacent pointers in u 
and since p and q are negative, either u = U\pqu2qpu 3 or u = U\pqu2pqu 3 with 
Ui G II* (with possibly p and q interchanged). Assume that u = U\pqu2pqu 3 
(with possibly p and q interchanged). Then there must be vertices I[ and 7j 
labelled by p with a desire edge Jj) in P. But this is impossible since 
p is negative. Consequently, u = Uipqu 2 qpu 3 (with possibly p and q inter- 
changed) . □ 

The following theorem states that only the string negative rules can remove 
cyclic components. This is consistent with the fact that only loop recombina- 
tion introduces a new (cyclic) molecule, cf. Figure 3. Clearly, by the definition 
of reduction function, a cyclic component is removed by simply removing its 
vertices and edges and not by merging with another component. 

Theorem 25 Let u be a legal string, let N be the number of cyclic components 
ofJZ u ,0, and let p G IT with p G dom{u). 

• If snr p is applicable to u, then the reduction graph of snr p (-u) has exactly 
N — 1 cyclic components. 

• If spr p is applicable to u, then the reduction graph of spr p (u) has exactly 
N cyclic components. 

Now let q G II with q G dom(u) and p ^ q. 

• If sdr M is applicable to u, then the reduction graph of sdr Pyq (u) has exactly 
N cyclic components. 

PROOF. First note that by the definition of reduction function and Theo- 
rem 17 the number of cyclic components cannot increase when applying re- 
duction rules. 

Let snr p be applicable to u. By Lemma 23, TZ Uj0 has a cyclic component 
consisting of exactly two vertices, which are both labelled by p. It follows then 
from Theorem 17 that the reduction graph of snr p (n) has at most N — 1 cyclic 
components. The other two vertices labelled by p are connected by reality 
edges to vertices that are not labelled by p, and therefore this component 
does not disappear. Hence, the reduction graph of snr p («) has exactly N — 1 
cyclic components. 

Let spr p be applicable to u. Assume that the reduction graph of spr p (u) has 
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less than N cyclic components. Then by Theorem 17, there exist a cyclic 
component P of TZ Ut0 consisting of only vertices labelled by p. By Lemma 22, 
P consists of only two vertices. By Lemma 23, either pp or pp is a substring of 
u and thus spr p is not applicable to u. This is a contradiction. Consequently, 
the reduction graph of spr p (n) has exactly iV cyclic components. 

Let sdr pg be applicable to u. Assume that the reduction graph of sdr Pig (u) 
has less than N cyclic components. Then there exist a cyclic component P in 
lZ Ut consisting only of vertices labelled by p and q. Assume that all vertices 
of P are labelled by p. Then, analogously to the previous case, we deduce that 
either pp or pp is a substring of u. Thus sdr p (? is not applicable to u. This 
is a contradiction. Similarly, P cannot consist only of vertices labelled by q. 
Assume then that P consists of vertices that are labelled by both p and q. 
By Lemma 22 and the fact that pointers p and q overlap, there are only two 
vertices labelled by p in P and two vertices labelled by q in P. By Lemma 24, 
either u = Uipqu 2 qpu 3 or u = Uiqpu 2 pqu 3 for some strings Ui,u 2 ,u 3 G II*. 
Thus sdr p>g is not applicable to u. This is a contradiction. Therefore, such a 
component P cannot exist and so the reduction graph of sdr Pj(? (w) has exactly 
iV cyclic components. □ 

The previous theorem can be reformulated as follows, yielding a key property 
of reduction graphs. 

Theorem 26 Let N be the number of cyclic components of the reduction 
graph of legal string u. Then every successful reduction of u has exactly N 
string negative rules. 

The Invariant Theorem [6] (and Chapter 12 in [4]) shows that all successful 
reductions of a realistic string u have the same number of string negative rules. 
Therefore, Theorem 26 can be seen as a generalization of this result, since it 
holds for legal strings in general. Indeed, the technical framework used in [6] 
is the MDS descriptor reduction system which is only suited to model realistic 
strings. 

Moreover, Theorem 26 shows that this number N is an elegant graph theo- 
retical property of the reduction graph. As a consequence, it can be efficiently 
obtained. Since it takes 0(|if|) to generate 1Z U ,0, and again 0(|tt|) to determine 
the number of connected components of TZ Ut 0, the previous theorem implies 
that it takes only linear time to determine how many string negative rules are 
needed to successfully reduce legal string u. Theorem 26 will be used in the 
next section, when we characterize successfulness in S C {Spr, Sdr}. 

Example 27 Let u = 232434 be a legal string. The reduction graph of u is 
depicted in Figure 11, where Si = A for all i G {0, 1, . . . , 6}. By Theorem 26 
every reduction of u has exactly one string negative rule. There are exactly 
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Fig. 13. The reduction graph of u = 2233. 



four successful reductions of u, these are snr 2 spr 3 spr^, snrg spr 2 spr^, 
snrg sprj spr 2 and snr 4 sprg spr 2 . Notice that each of these reductions has 
exactly one string negative rule. □ 

Remark 28 Results in [5] (and Chapter 13 in [4]) show that a successful 
reduction of a realistic string u has at least one string negative rule if the 
string has a disjoint cycle. Clearly, the notions of disjoint cycle and (cyclic) 
component are related. It is easy to verify that every disjoint cycle of a string 
can be found as a connected component of the reduction graph of the string, 
although that might be the linear component. As an example, consider the re- 
alistic string u = 7T 3 (MiM 2 M 3 ) = 2233. This realistic string has three disjoint 
cycles {22}, {33}, and {23,32} corresponding to the connected components 
of the reduction graph of u, see Figure 13. This correspondence is not a bi- 
jection for all legal strings, not even for realistic ones. E.g., realistic string 
u = ti^{M^MiM2) = 3223 has only a single disjoint cycle {33} whereas its 
reduction graph has two components, one linear and one cyclic. Hence, the 
number of disjoint cycles cannot be used to characterize the number of string 
negative rules present in every successful reduction of u. □ 

It is easy to see that for legal string u and D C dom(u), TZ re m D (u),0 is iso- 
morphic to TZ u ,d modulo the labels of the edges. Now, we have the following 
corollary to Theorems 26. 

Corollary 29 Let u be a legal string, D C dom(u), and let N be the number of 
cyclic components oflZ U} £>. Then every reduction ip of u with dom(ip(u)) = D 
has exactly N string negative rules. 



PROOF. Let (f be a reduction of u with dom(tp(u)) = D. Then by Theo- 
rem 10, <f is a successful reduction of rem D {u). Since 11 Ui d is isomorphic to 
7l r em D (u),0 modulo the labels of the edges, 7Z r em D (u),0 has N cyclic compo- 
nents. By Theorem 26, tp has exactly TV string negative rules. □ 
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10 Successfulness of Legal Strings 



In [5] (and Chapter 13 in [4]) an elementary characterization of the realistic 
strings that are successful in any given S C {Snr, Spr, Sdr} is presented. This 
is helpful in applying Theorem 18, where reducibility of legal string u into 
legal string v is translated into successfulness of rerrio(u) with D = dom{y). 
Unfortunately, even when u is a realistic string, rerrio{u) for some D C dom(u) 
is not necessary a realistic string. For example, u = -k^{MiM2M^,M4 : M^) = 
22343455 is realistic, while rem^{u) = 223355 is not. As a matter of fact, it 
can be shown that this legal string is not even realizable, that is, the legal string 
can not be transformed into a realistic string by renaming pointers. Formally, 
legal string v is realizable if there exists a homomorphism h : IT — > IT with 
h(p) = h(p) for all p G II such that h{y) is realistic. Thus, e.g., 223344 and 
223344 are also not realistic. 

In this section we generalize the results from [5], and give a characterization 
of the legal strings that are successful in any given 5" C {Snr, Spr, Sdr}. 
Theorems 32, 33, and 35 are the 'legal counterparts' of Theorems 8, 9, and 6 
in [5], respectively. These results are independent of the results in the previous 
sections of this paper. On the other hand, Theorems 37, 38, and 40 (the 'legal 
counterparts' of Theorems 14, 11, and 13 in [5], respectively) rely heavily on 
Theorem 26. 

10.1 Trivial Generalizations and Known Results 

In the cases of {Snr, Spr}, {Snr, Sdr}, and {Snr, Spr, Sdr}, the characteri- 
zations from [5] (and Chapter 13 in [4]) and their proofs, although stated in 
terms of realistic strings, are valid for legal strings in general. The results are 
given below for completeness. First we restate Lemma 4 and Lemma 7 from 
[5] respectively, which will be used in our considerations below. 

Lemma 30 Let u = av(3 be a legal string such that v is also a legal string, 
and let S C {Snr, Spr, Sdr}. Then u is successful in S iff both v and a(3 are 
successful in S. 

Lemma 31 Let u be an elementary legal string. Then u is successful in {Snr, Spr} 
iff either u contains at least one positive pointer or u = pp for some p G IT. 

The following result follows directly from Lemma 30 and Lemma 31. It is the 
'legal version' of Theorem 8 in [5], which can be taken almost verbatim. 

Theorem 32 Let u be a legal string. Then u is successful in {Snr, Spr} iff 
for all legal substrings v of u, ifv = V\U\Vi ■ ■ • VjUjVj + i, where each Ui is a legal 
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substring, then V\V2 ■ • • Uj+i either contains a positive pointer or is successful 
in {Snr}. 

The previous theorem can be stated more elegantly in terms of connected 
components of the overlap graph of u, see [4, p. 141]. Note that characterization 
for case {Snr, Spr} refers to the case of {Snr}. The latter case does differ from 
the realistic characterization in [5], and is treated later. 

Theorem 33 Let u be a legal string. Then u is successful in {Snr, Sdr} iff 
all the pointers in u are negative. 

We give now the legal version of Theorem 9.1 in [4] — it is a direct consequence 
of Theorems 32 and 33. Without restrictions on the types of reduction rules 
used, every legal string is successful, cf. the remark below the definition of the 
reduction rules, in Section 4. 

Theorem 34 Every legal string is successful in {Snr, Spr, Sdr}. 
10.2 Non-Trivial Generalizations 

The following theorem is the legal counterpart of Theorem 6 in [5]. It turns 
out to be much less restrictive than the original realistic version. 

Theorem 35 Let u be a legal string. Then u is successful in {Snr} iff u 
consists of negative pointers only and no two pointers overlap in u. 

PROOF. The condition from the statement of the lemma is obviously nec- 
essary, because snr cannot resolve overlapping or positive pointers. We will 
now prove that this condition is also sufficient. If no two pointers overlap in u, 
then there must be a substring pp or pp of u for some pointer p. If moreover 
u consists of negative pointers only, then pp is a substring of u. So snr p is 
applicable to u. Now, again no two pointers overlap in legal string snr p (w), 
and snr p (-u) consists of negative pointers only. By iteration of this argument 
we conclude that u is successful in {Snr}. □ 

Observe that the {Snr} case is referred to in the characterization of {Snr, Spr} 
in Theorem 32. With the above result we can rephrase the latter result as 
follows. 

Corollary 36 Let u be a legal string. Then u is successful in {Snr, Spr} iff 
for all legal substrings v of u, if v = V\U\Vi • • • VjUjVj+i, where each U{ is a 
legal substring, then, if ' v\Vi ■ ■ • Vj+\ consists of negative pointers only, they are 
nonoverlapping. 
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The following result follows directly from Theorem 26; a successful reduction 
without string negative rules means that the reduction graph has a single 
(linear) connected component. 

Theorem 37 Let u be a legal string. Then u is successful in {Spr, Sdr} iff 
the reduction graph of u has no cyclic component. 

Theorem 14 in [5] is the realistic predecessor of this result, but instead of cyclic 
components it uses disjoint cycles, cf. Remark 28. The latter notion cannot be 
used in the general e.g., the legal string 233244 has no disjoint cycle, 

but its reduction graph has one cyclic component. Obviously, the only way 
to reduce this string is to apply spr 3 and spr 4 (in either order) and then to 
apply snr2- In particular, the converse of Corollary 13.1 in [4] does not hold. 

In the same way as Theorem 37 relates to Theorem 14 in [5], the following 
theorem and lemma relate to Theorem 11 and Lemma 12 from [5], respectively. 

Theorem 38 Let u be a legal string. Then u is successful in {Sdr} iff u 
consists of negative pointers only and 1Z U)0 has no cyclic component. 

PROOF. The forward implication follows directly from Theorem 26 and the 
fact that sdr cannot resolve positive pointers. To prove the reverse implication, 
let u consist of negative pointers only, and let the corresponding reduction 
graph 1Z U : have no cyclic component. By Theorem 37, there is a successful 
{Spr, S'drj-reduction if of u. Since u consists of negative pointers only, if is a 
successful {Sdr }-reduction of u (as applications of string double rules do not 
introduce positive pointers). □ 

Lemma 39 Let u be an elementary legal string. Then u is successful in {Spr} 
iff u contains a positive pointer and 1Z U ,0 has no cyclic component. 

PROOF. The forward implication follows directly from Theorem 26. To 
prove the reverse implication, let u contain a positive pointer and let TZ Ut0 
have no cyclic component. By Lemma 31, there is a successful {Snr, Spr}- 
reduction f of u. By Theorem 26, if is a {Spr }-reduction of u. □ 

The following result follows directly from Lemmas 30 and 39 — it relates to 
Theorem 13 in [5]. 

Theorem 40 Let u be a legal string. Then u is successful in {Spr} iff for 
all legal substrings v of u, if v = V\U\V2 ■ ■ -VjUjVj + i, where each Ui is a legal 
substring, then v\V2 • ■ ■ f j+i either is A or contains a positive pointer and its 
reduction graph has no cyclic component. 
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Similarly to Theorem 32, the previous theorem can be stated in terms of 
connected components of the overlap graph of u. 

Recall that for legal string u and D C dom(u), TZ r em D (u),0 is isomorphic to 
TZ u ,d modulo the labels of the edges. Then, by Theorems 18 and 37, we have 
the following corollary. In this result it is especially apparent that both the 
linear component and the cyclic components of reduction graphs reveal crucial 
properties concerning reducibility. 

Corollary 41 Let u and v be legal strings with D = dom(v) C dom(u). 
Then u is reducible to v in {Spr, Sdr} iffTZ u ,D has no cyclic component and 
red{JZ U)D ) = v. 



11 Discussion 



This paper introduces the concept of breakpoint graph (or reality and desire 
diagram) into gene assembly models, through the notion of reduction graph. 
The reduction graph provides surprisingly valuable insights into the gene as- 
sembly process. First, it allows one to characterize which gene patterns can 
occur during the transformation of a given gene from its MIC form to its 
MAC form. Formally, in the string pointer reduction system we characterize 
whether a legal string u is reducible to a legal string v for a given set of reduc- 
tion rule types. The characterization is independent from the chosen subset of 
the three types of string pointer rules, and it allows us to determine whether a 
legal string u is reducible to a legal string v in linear time. This generalizes the 
characterization of successfulness in [5], since the reduced string need not be 
the empty string. Secondly, the reduction graph allows one to determine the 
number of loop recombination operations that are necessary in the transfor- 
mation of a given gene from its MIC form to its MAC form. This result allows 
for a second generalization of the characterization of successfulness, since we 
consider legal strings instead of realistic strings. 

Reduction graphs are defined for legal strings, the basic notion of the string 
pointer reduction system that represents the genes. Future research could focus 
on the possibility of defining a similar notion for overlap graphs, which are 
used in the the graph pointer reduction system — a model (almost) equivalent 
to the string pointer reduction system. This would allow results in this paper 
to be carried over to the graph pointer reduction system. 
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